{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hcD2nPQvPOFM"
      },
      "source": [
        "##### Copyright 2018 The TensorFlow Authors.\n",
        "\n",
        "Licensed under the Apache License, Version 2.0 (the \"License\").\n",
        "\n",
        "# Text Generation using a RNN\n",
        "\n",
        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\u003ctd\u003e\n",
        "\u003ca target=\"_blank\"  href=\"https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/generative_examples/text_generation.ipynb\"\u003e\n",
        "    \u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e  \n",
        "\u003c/td\u003e\u003ctd\u003e\n",
        "\u003ca target=\"_blank\"  href=\"https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/examples/generative_examples/text_generation.ipynb\"\u003e\u003cimg width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on Github\u003c/a\u003e\u003c/td\u003e\u003c/table\u003e"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "BwpJ5IffzRG6"
      },
      "source": [
        "This notebook demonstrates how to generate text using an RNN using [tf.keras](https://www.tensorflow.org/programmers_guide/keras) and [eager execution](https://www.tensorflow.org/programmers_guide/eager). If you like, you can write a similar [model](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/8.1-text-generation-with-lstm.ipynb) using less code. Here, we show a lower-level impementation that's useful to understand as prework before diving in to deeper examples in a similar, like [Neural Machine Translation with Attention](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb).\n",
        "\n",
        "This notebook is an end-to-end example. When you run it, it will download a dataset of Shakespeare's writing. We'll use a collection of plays, borrowed from Andrej Karpathy's excellent [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).  The notebook will train a model, and use it to generate sample output.\n",
        "  \n",
        "Here is the output(with start string='w') after training a single layer GRU for 30 epochs with the default settings below:\n",
        "\n",
        "```\n",
        "were to the death of him\n",
        "And nothing of the field in the view of hell,\n",
        "When I said, banish him, I will not burn thee that would live.\n",
        "\n",
        "HENRY BOLINGBROKE:\n",
        "My gracious uncle--\n",
        "\n",
        "DUKE OF YORK:\n",
        "As much disgraced to the court, the gods them speak,\n",
        "And now in peace himself excuse thee in the world.\n",
        "\n",
        "HORTENSIO:\n",
        "Madam, 'tis not the cause of the counterfeit of the earth,\n",
        "And leave me to the sun that set them on the earth\n",
        "And leave the world and are revenged for thee.\n",
        "\n",
        "GLOUCESTER:\n",
        "I would they were talking with the very name of means\n",
        "To make a puppet of a guest, and therefore, good Grumio,\n",
        "Nor arm'd to prison, o' the clouds, of the whole field,\n",
        "With the admire\n",
        "With the feeding of thy chair, and we have heard it so,\n",
        "I thank you, sir, he is a visor friendship with your silly your bed.\n",
        "\n",
        "SAMPSON:\n",
        "I do desire to live, I pray: some stand of the minds, make thee remedies\n",
        "With the enemies of my soul.\n",
        "\n",
        "MENENIUS:\n",
        "I'll keep the cause of my mistress.\n",
        "\n",
        "POLIXENES:\n",
        "My brother Marcius!\n",
        "\n",
        "Second Servant:\n",
        "Will't ple\n",
        "```\n",
        "\n",
        "Of course, while some of the sentences are grammatical, most do not make sense. But, consider:\n",
        "\n",
        "* Our model is character based (when we began training, it did not yet know how to spell a valid English word, or that words were even a unit of text).\n",
        "\n",
        "* The structure of the output resembles a play (blocks begin with a speaker name, in all caps similar to the original text). Sentences generally end with a period. If you look at the text from a distance (or don't read the invididual words too closely, it appears as if it's an excerpt from a play).\n",
        "\n",
        "As a next step, you can experiment training the model on a different dataset - any large text file(ASCII) will do, and you can modify a single line of code below to make that change. Have fun!\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "R3p22DBDsaCA"
      },
      "source": [
        "## Install unidecode library\n",
        "A helpful library to convert unicode to ASCII."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "wZ6LOM12wKGH"
      },
      "outputs": [],
      "source": [
        "!pip install unidecode"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "WGyKZj3bzf9p"
      },
      "source": [
        "## Import tensorflow and enable eager execution."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "yG_n40gFzf9s"
      },
      "outputs": [],
      "source": [
        "# Import TensorFlow \u003e= 1.10 and enable eager execution\n",
        "import tensorflow as tf\n",
        "\n",
        "# Note: Once you enable eager execution, it cannot be disabled. \n",
        "tf.enable_eager_execution()\n",
        "\n",
        "import numpy as np\n",
        "import os\n",
        "import re\n",
        "import random\n",
        "import unidecode\n",
        "import time"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "EHDoRoc5PKWz"
      },
      "source": [
        "## Download the dataset\n",
        "\n",
        "In this example, we will use the [shakespeare dataset](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt). You can use any other dataset that you like.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "pD_55cOxLkAb"
      },
      "outputs": [],
      "source": [
        "path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "UHjdCjDuSvX_"
      },
      "source": [
        "## Read the dataset\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "-E5JvY3wzf94"
      },
      "outputs": [],
      "source": [
        "text = unidecode.unidecode(open(path_to_file).read())\n",
        "# length of text is the number of characters in it\n",
        "print (len(text))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Il9ww98izf-D"
      },
      "source": [
        "Creating dictionaries to map from characters to their indices and vice-versa, which will be used to vectorize the inputs"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "IalZLbvOzf-F"
      },
      "outputs": [],
      "source": [
        "# unique contains all the unique characters in the file\n",
        "unique = sorted(set(text))\n",
        "\n",
        "# creating a mapping from unique characters to indices\n",
        "char2idx = {u:i for i, u in enumerate(unique)}\n",
        "idx2char = {i:u for i, u in enumerate(unique)}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "1v_qUYfAzf-I"
      },
      "outputs": [],
      "source": [
        "# setting the maximum length sentence we want for a single input in characters\n",
        "max_length = 100\n",
        "\n",
        "# length of the vocabulary in chars\n",
        "vocab_size = len(unique)\n",
        "\n",
        "# the embedding dimension \n",
        "embedding_dim = 256\n",
        "\n",
        "# number of RNN (here GRU) units\n",
        "units = 1024\n",
        "\n",
        "# batch size \n",
        "BATCH_SIZE = 64\n",
        "\n",
        "# buffer size to shuffle our dataset\n",
        "BUFFER_SIZE = 10000"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "LFjSVAlWzf-N"
      },
      "source": [
        "## Creating the input and output tensors\n",
        "\n",
        "Vectorizing the input and the target text because our model cannot understand strings only numbers.\n",
        "\n",
        "But first, we need to create the input and output vectors.\n",
        "Remember the max_length we set above, we will use it here. We are creating **max_length** chunks of input, where each input vector is all the characters in that chunk except the last and the target vector is all the characters in that chunk except the first.\n",
        "\n",
        "For example, consider that the string = 'tensorflow' and the max_length is 9\n",
        "\n",
        "So, the `input = 'tensorflo'` and `output = 'ensorflow'`\n",
        "\n",
        "After creating the vectors, we convert each character into numbers using the **char2idx** dictionary we created above."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "0UHJDA39zf-O"
      },
      "outputs": [],
      "source": [
        "input_text = []\n",
        "target_text = []\n",
        "\n",
        "for f in range(0, len(text)-max_length, max_length):\n",
        "    inps = text[f:f+max_length]\n",
        "    targ = text[f+1:f+1+max_length]\n",
        "\n",
        "    input_text.append([char2idx[i] for i in inps])\n",
        "    target_text.append([char2idx[t] for t in targ])\n",
        "    \n",
        "print (np.array(input_text).shape)\n",
        "print (np.array(target_text).shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "MJdfPmdqzf-R"
      },
      "source": [
        "## Creating batches and shuffling them using tf.data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "p2pGotuNzf-S"
      },
      "outputs": [],
      "source": [
        "dataset = tf.data.Dataset.from_tensor_slices((input_text, target_text)).shuffle(BUFFER_SIZE)\n",
        "dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "m8gPwEjRzf-Z"
      },
      "source": [
        "## Creating the model\n",
        "\n",
        "We use the Model Subclassing API which gives us full flexibility to create the model and change it however we like. We use 3 layers to define our model.\n",
        "\n",
        "* Embedding layer\n",
        "* GRU layer (you can use an LSTM layer here)\n",
        "* Fully connected layer"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "P3KTiiInzf-a"
      },
      "outputs": [],
      "source": [
        "class Model(tf.keras.Model):\n",
        "  def __init__(self, vocab_size, embedding_dim, units, batch_size):\n",
        "    super(Model, self).__init__()\n",
        "    self.units = units\n",
        "    self.batch_sz = batch_size\n",
        "\n",
        "    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)\n",
        "\n",
        "    if tf.test.is_gpu_available():\n",
        "      self.gru = tf.keras.layers.CuDNNGRU(self.units, \n",
        "                                          return_sequences=True, \n",
        "                                          return_state=True, \n",
        "                                          recurrent_initializer='glorot_uniform')\n",
        "    else:\n",
        "      self.gru = tf.keras.layers.GRU(self.units, \n",
        "                                     return_sequences=True, \n",
        "                                     return_state=True, \n",
        "                                     recurrent_activation='sigmoid', \n",
        "                                     recurrent_initializer='glorot_uniform')\n",
        "\n",
        "    self.fc = tf.keras.layers.Dense(vocab_size)\n",
        "        \n",
        "  def call(self, x, hidden):\n",
        "    x = self.embedding(x)\n",
        "\n",
        "    # output shape == (batch_size, max_length, hidden_size) \n",
        "    # states shape == (batch_size, hidden_size)\n",
        "\n",
        "    # states variable to preserve the state of the model\n",
        "    # this will be used to pass at every step to the model while training\n",
        "    output, states = self.gru(x, initial_state=hidden)\n",
        "\n",
        "\n",
        "    # reshaping the output so that we can pass it to the Dense layer\n",
        "    # after reshaping the shape is (batch_size * max_length, hidden_size)\n",
        "    output = tf.reshape(output, (-1, output.shape[2]))\n",
        "\n",
        "    # The dense layer will output predictions for every time_steps(max_length)\n",
        "    # output shape after the dense layer == (max_length * batch_size, vocab_size)\n",
        "    x = self.fc(output)\n",
        "\n",
        "    return x, states"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "trpqTWyvk0nr"
      },
      "source": [
        "## Call the model and set the optimizer and the loss function"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "7t2XrzEOzf-e"
      },
      "outputs": [],
      "source": [
        "model = Model(vocab_size, embedding_dim, units, BATCH_SIZE)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "dkjWIATszf-h"
      },
      "outputs": [],
      "source": [
        "optimizer = tf.train.AdamOptimizer()\n",
        "\n",
        "# using sparse_softmax_cross_entropy so that we don't have to create one-hot vectors\n",
        "def loss_function(real, preds):\n",
        "    return tf.losses.sparse_softmax_cross_entropy(labels=real, logits=preds)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "3K6s6F79P7za"
      },
      "source": [
        "## Checkpoints (Object-based saving)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "oAGisDdfP9rL"
      },
      "outputs": [],
      "source": [
        "checkpoint_dir = './training_checkpoints'\n",
        "checkpoint_prefix = os.path.join(checkpoint_dir, \"ckpt\")\n",
        "checkpoint = tf.train.Checkpoint(optimizer=optimizer,\n",
        "                                 model=model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "lPrP0XMUzf-p"
      },
      "source": [
        "## Train the model\n",
        "\n",
        "Here we will use a custom training loop with the help of GradientTape()\n",
        "\n",
        "* We initialize the hidden state of the model with zeros and shape == (batch_size, number of rnn units). We do this by calling the function defined while creating the model.\n",
        "\n",
        "* Next, we iterate over the dataset(batch by batch) and calculate the **predictions and the hidden states** associated with that input.\n",
        "\n",
        "* There are a lot of interesting things happening here.\n",
        "  * The model gets hidden state(initialized with 0), lets call that **H0** and the first batch of input, lets call that **I0**.\n",
        "  * The model then returns the predictions **P1** and **H1**.\n",
        "  * For the next batch of input, the model receives **I1** and **H1**.\n",
        "  * The interesting thing here is that we pass **H1** to the model with **I1** which is how the model learns. The context learned from batch to batch is contained in the **hidden state**.\n",
        "  * We continue doing this until the dataset is exhausted and then we start a new epoch and repeat this.\n",
        "\n",
        "* After calculating the predictions, we calculate the **loss** using the loss function defined above. Then we calculate the gradients of the loss with respect to the model variables(input)\n",
        "\n",
        "* Finally, we take a step in that direction with the help of the optimizer using the apply_gradients function.\n",
        "\n",
        "Note:- If you are running this notebook in Colab which has a **Tesla K80 GPU** it takes about 23 seconds per epoch.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "d4tSNwymzf-q"
      },
      "outputs": [],
      "source": [
        "# Training step\n",
        "\n",
        "EPOCHS = 20\n",
        "\n",
        "for epoch in range(EPOCHS):\n",
        "    start = time.time()\n",
        "    \n",
        "    # initializing the hidden state at the start of every epoch\n",
        "    hidden = model.reset_states()\n",
        "    \n",
        "    for (batch, (inp, target)) in enumerate(dataset):\n",
        "          with tf.GradientTape() as tape:\n",
        "              # feeding the hidden state back into the model\n",
        "              # This is the interesting step\n",
        "              predictions, hidden = model(inp, hidden)\n",
        "              \n",
        "              # reshaping the target because that's how the \n",
        "              # loss function expects it\n",
        "              target = tf.reshape(target, (-1,))\n",
        "              loss = loss_function(target, predictions)\n",
        "              \n",
        "          grads = tape.gradient(loss, model.variables)\n",
        "          optimizer.apply_gradients(zip(grads, model.variables))\n",
        "\n",
        "          if batch % 100 == 0:\n",
        "              print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch+1,\n",
        "                                                            batch,\n",
        "                                                            loss))\n",
        "    # saving (checkpoint) the model every 5 epochs\n",
        "    if (epoch + 1) % 5 == 0:\n",
        "      checkpoint.save(file_prefix = checkpoint_prefix)\n",
        "\n",
        "    print ('Epoch {} Loss {:.4f}'.format(epoch+1, loss))\n",
        "    print('Time taken for 1 epoch {} sec\\n'.format(time.time() - start))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "01AR9vpNQMFF"
      },
      "source": [
        "## Restore the latest checkpoint"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "tyvpYomYQQkF"
      },
      "outputs": [],
      "source": [
        "# restoring the latest checkpoint in checkpoint_dir\n",
        "checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "DjGz1tDkzf-u"
      },
      "source": [
        "## Predicting using our trained model\n",
        "\n",
        "The below code block is used to generated the text\n",
        "\n",
        "* We start by choosing a start string and initializing the hidden state and setting the number of characters we want to generate.\n",
        "\n",
        "* We get predictions using the start_string and the hidden state\n",
        "\n",
        "* Then we use argmax to calculate the index of the predicted word. **We use this predicted word as our next input to the model**\n",
        "\n",
        "* **The hidden state returned by the model is fed back into the model so that it now has more context rather than just one word.** After we predict the next word, the modified hidden states are again fed back into the model, which is how it learns as it gets more context from the previously predicted words.\n",
        "\n",
        "* If you see the predictions, the model knows when to capitalize, make paragraphs and the text follows a shakespeare style of writing which is pretty awesome!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "WvuwZBX5Ogfd"
      },
      "outputs": [],
      "source": [
        "# Evaluation step(generating text using the model learned)\n",
        "\n",
        "# number of characters to generate\n",
        "num_generate = 1000\n",
        "\n",
        "# You can change the start string to experiment\n",
        "start_string = 'Q'\n",
        "# converting our start string to numbers(vectorizing!) \n",
        "input_eval = [char2idx[s] for s in start_string]\n",
        "input_eval = tf.expand_dims(input_eval, 0)\n",
        "\n",
        "# empty string to store our results\n",
        "text_generated = ''\n",
        "\n",
        "# hidden state shape == (batch_size, number of rnn units); here batch size == 1\n",
        "hidden = [tf.zeros((1, units))]\n",
        "for i in range(num_generate):\n",
        "    predictions, hidden = model(input_eval, hidden)\n",
        "\n",
        "    # using argmax to predict the word returned by the model\n",
        "    predicted_id = tf.argmax(predictions[-1]).numpy()\n",
        "    \n",
        "    # We pass the predicted word as the next input to the model\n",
        "    # along with the previous hidden state\n",
        "    input_eval = tf.expand_dims([predicted_id], 0)\n",
        "    \n",
        "    text_generated += idx2char[predicted_id]\n",
        "\n",
        "print (start_string + text_generated)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "AM2Uma_-yVIq"
      },
      "source": [
        "## Next steps\n",
        "\n",
        "* Change the start string to a different character, or the start of a sentence.\n",
        "* Experiment with training on a different, or with different parameters. [Project  Gutenberg](http://www.gutenberg.org/ebooks/100), for example, contains a large collection of books.\n",
        "* Add another RNN layer.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "gtEd86sX5cB2"
      },
      "outputs": [],
      "source": [
        ""
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "text_generation.ipynb",
      "private_outputs": true,
      "provenance": [],
      "toc_visible": true,
      "version": "0.3.2"
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
